Analysis of COVID-19 in Utah

Overview of Covid-19 Data in Utah by County

Analysis Objective:

The analysis objective in this situation is to look into the relationships between the demographics of Utah counties and the number of Covid-19 cases, deaths, and prevalence of mask wearing in those counties.

Select Data:

All data points are from Utah and all of its counties. The data includes various demographics (median age, population, race) for all Utah counties and Covid-19 data: case numbers, death numbers, and percentage of reported mask wearing in counties based on never, rarely, sometimes, frequently, and always. There are, in total, 7317 observations of 52 variables.
This data was compiled by New York Times and found at https://github.com/nytimes/covid-19-data/tree/f7dd58814ab56a507ce01741157772f613840671

Data Analysis:

Unsupervised Analysis: Kmeans Clustering

Covid-19 Cases and Prevalence of Mask Wearing and County - Overview
##     cases NEVER RARELY SOMETIMES FREQUENTLY ALWAYS
## 311     1 0.028  0.032     0.094      0.202  0.644
## 331     1 0.028  0.032     0.094      0.202  0.644
## 351     1 0.028  0.032     0.094      0.202  0.644
## 372     1 0.028  0.032     0.094      0.202  0.644
## 394     1 0.028  0.032     0.094      0.202  0.644
## 425     1 0.028  0.032     0.094      0.202  0.644

Summary of Covid Data

#unsupervised
#remove nas 

summary(complete_Covid)
##      cases           NEVER             RARELY          SOMETIMES     
##  Min.   :    1   Min.   :0.00200   Min.   :0.02300   Min.   :0.0330  
##  1st Qu.:   14   1st Qu.:0.04000   1st Qu.:0.04300   1st Qu.:0.0720  
##  Median :  103   Median :0.06800   Median :0.06400   Median :0.0980  
##  Mean   : 1703   Mean   :0.09171   Mean   :0.09169   Mean   :0.1162  
##  3rd Qu.:  650   3rd Qu.:0.09900   3rd Qu.:0.11400   3rd Qu.:0.1410  
##  Max.   :74269   Max.   :0.43200   Max.   :0.29600   Max.   :0.2710  
##    FREQUENTLY         ALWAYS      
##  Min.   :0.1710   Min.   :0.1750  
##  1st Qu.:0.2120   1st Qu.:0.3530  
##  Median :0.2690   Median :0.4210  
##  Mean   :0.2691   Mean   :0.4312  
##  3rd Qu.:0.3000   3rd Qu.:0.5180  
##  Max.   :0.4690   Max.   :0.6510

Plotting the different aspects to find the best number of Clusters for the analysis

The elbow bend is at 5, so I created groupings based on 5 clusters.

I created 5 clusters with 25 random starting assignments.
In order to get a good overall view of the clustering, I aggregated the results so as show the means for each cluster.
I did not include the original clustering assignments as the vector of integers which indicated the cluster assignation for each data point was very large.
CClusters <- kmeans(complete_Covid, 5, nstart = 25)

aggregate(complete_Covid, by=list(cluster=CClusters$cluster),mean)
##   cluster      cases      NEVER     RARELY SOMETIMES FREQUENTLY    ALWAYS
## 1       1   364.0402 0.09606134 0.09584397 0.1163072  0.2706973 0.4210397
## 2       2 37829.0492 0.04373770 0.03318033 0.1337377  0.2185246 0.5708197
## 3       3 21819.3361 0.04227049 0.03362295 0.1291803  0.2195574 0.5753443
## 4       4  6778.7797 0.05196203 0.05896456 0.1097646  0.2715291 0.5074532
## 5       5 61329.0870 0.02800000 0.03200000 0.0940000  0.2020000 0.6440000
At this point, I included a small overview of each cluster and the county that it belongs to.
I also included the last 6 rows of this data and there is a larger variety of county information.
cluster_w_county <- cbind(complete_county, cluster = CClusters$cluster)

#head(cluster_w_county)
tail(cluster_w_county)
##        cases NEVER RARELY SOMETIMES FREQUENTLY ALWAYS                  County
## 775630   666 0.121  0.269     0.046      0.389  0.175     Uintah County, Utah
## 775634 45297 0.068  0.035     0.195      0.244  0.458       Utah County, Utah
## 775635  1927 0.039  0.063     0.056      0.254  0.588    Wasatch County, Utah
## 775636  8563 0.034  0.056     0.123      0.269  0.518 Washington County, Utah
## 775637    44 0.068  0.129     0.114      0.269  0.420      Wayne County, Utah
## 775638 11586 0.066  0.089     0.072      0.287  0.486      Weber County, Utah
##        cluster
## 775630       1
## 775634       2
## 775635       1
## 775636       4
## 775637       1
## 775638       4

Supervised: Estimation - Regression to explain the variablity in COVID-19 cases

In this type of model, I wanted to include more variables in addition to the COVID-19 data. I also included demographic data for all Utah counties, including race, income, population, smoker, obesity, diabetes, median age, and uninsured.
I then created a linear regression model to be able to see the most significant variables in the data.
By looking at the residuals of this data, I can tell that this data needs to be explored further in order to achieve residuals that are more in line with one another. I ran a few variations on this model and finally settled on the current model. It still needs to be worked with in order to bring the residuals more in line with one another.
When looking at the Adjusted R-squared and Multiple R-squared, I can see that both are around .90 which tells us that this is a good model.
The F-statistic is also high and the p-value is low.
The model explains 90% of the variability in cases of COVID-19.
## 
## Call:
## lm(formula = cases ~ ., data = int_small_decision_tree)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9096.4  -217.2    -6.7   104.7 21112.7 
## 
## Coefficients: (1 not defined because of singularities)
##                                                              Estimate
## (Intercept)                                                -1.207e+06
## deaths                                                      1.654e+02
## NEVER                                                       8.300e+05
## RARELY                                                      8.657e+05
## SOMETIMES                                                   7.968e+05
## FREQUENTLY                                                  8.006e+05
## ALWAYS                                                      8.204e+05
## Less.Than.High.School                                      -8.917e+02
## At.Least.High.School.Diploma                                       NA
## At.Least.Bachelor.s.Degree                                  5.179e+02
## Graduate.Degree                                            -4.668e+02
## School.Enrollment                                          -4.433e+01
## Median.Earnings.2010.dollars                               -5.116e-01
## White.not.Latino.Population                                 3.969e+03
## African.American.Population                                 7.727e+03
## Native.American.Population                                  4.060e+03
## Asian.American.Population                                  -1.869e+03
## Population.some.other.race.or.races                         4.464e+03
## Latino.Population                                           3.957e+03
## Total.Population                                            7.104e-03
## Construction.extraction.maintenance.and.repair.occupations  4.022e+01
## median_age                                                  1.387e+02
## Adult.smoking                                               9.806e+03
## Adult.obesity                                               1.940e+04
##                                                            Std. Error t value
## (Intercept)                                                 2.825e+05  -4.273
## deaths                                                      1.007e+00 164.270
## NEVER                                                       7.648e+05   1.085
## RARELY                                                      7.740e+05   1.118
## SOMETIMES                                                   7.399e+05   1.077
## FREQUENTLY                                                  7.600e+05   1.053
## ALWAYS                                                      7.641e+05   1.074
## Less.Than.High.School                                       9.436e+02  -0.945
## At.Least.High.School.Diploma                                       NA      NA
## At.Least.Bachelor.s.Degree                                  1.204e+02   4.303
## Graduate.Degree                                             7.841e+01  -5.954
## School.Enrollment                                           2.075e+01  -2.137
## Median.Earnings.2010.dollars                                3.356e-01  -1.524
## White.not.Latino.Population                                 7.026e+03   0.565
## African.American.Population                                 9.575e+03   0.807
## Native.American.Population                                  6.926e+03   0.586
## Asian.American.Population                                   7.427e+03  -0.252
## Population.some.other.race.or.races                         5.625e+03   0.794
## Latino.Population                                           6.855e+03   0.577
## Total.Population                                            1.656e-03   4.290
## Construction.extraction.maintenance.and.repair.occupations  1.488e+02   0.270
## median_age                                                  2.770e+02   0.501
## Adult.smoking                                               2.066e+04   0.475
## Adult.obesity                                               8.064e+03   2.406
##                                                            Pr(>|t|)    
## (Intercept)                                                1.96e-05 ***
## deaths                                                      < 2e-16 ***
## NEVER                                                        0.2778    
## RARELY                                                       0.2634    
## SOMETIMES                                                    0.2815    
## FREQUENTLY                                                   0.2922    
## ALWAYS                                                       0.2830    
## Less.Than.High.School                                        0.3447    
## At.Least.High.School.Diploma                                     NA    
## At.Least.Bachelor.s.Degree                                 1.72e-05 ***
## Graduate.Degree                                            2.79e-09 ***
## School.Enrollment                                            0.0327 *  
## Median.Earnings.2010.dollars                                 0.1275    
## White.not.Latino.Population                                  0.5722    
## African.American.Population                                  0.4197    
## Native.American.Population                                   0.5577    
## Asian.American.Population                                    0.8013    
## Population.some.other.race.or.races                          0.4274    
## Latino.Population                                            0.5638    
## Total.Population                                           1.82e-05 ***
## Construction.extraction.maintenance.and.repair.occupations   0.7869    
## median_age                                                   0.6165    
## Adult.smoking                                                0.6351    
## Adult.obesity                                                0.0162 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2039 on 5370 degrees of freedom
## Multiple R-squared:  0.9063, Adjusted R-squared:  0.906 
## F-statistic:  2362 on 22 and 5370 DF,  p-value: < 2.2e-16

Decision Tree - Classification

I wanted to find local effects of the data and improve the model by creating a Decision Tree. The data is broken up by county in each of the models, so I can see how cases are impacted by each variable.
I knew from our linear model that death was one of the important variables in the linear model, which is not that surprising. I also wanted to see what else I could find that would show up in the Decision Tree model.
I can see that in all of the counties of Utah they are grouped the first time based on number of deaths. From there the cases are divided further by number of deaths and then they begin to be influence by different demographic variables, not necessarily those that were marked as important in the regression model.
## n= 4313 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 4313 168233800000  1977.8020  
##    2) deaths< 65.5 4139  18590430000   934.7949  
##      4) deaths< 30.5 3921   3487652000   538.8059  
##        8) deaths< 5.5 3268    420548600   236.7583 *
##        9) deaths>=5.5 653   1276842000  2050.4320 *
##      5) deaths>=30.5 218   3429258000  8057.1470 *
##    3) deaths>=65.5 174  38034090000 26788.1800  
##      6) deaths< 270.5 140  12725850000 21324.3600  
##       12) Less.Than.High.School>=8 106   7987302000 18537.6300  
##         24) deaths< 139 40    519388700  9044.7250 *
##         25) deaths>=139 66   1678689000 24290.9100 *
##       13) Less.Than.High.School< 8 34   1348991000 30012.3800 *
##      7) deaths>=270.5 34   3919152000 49286.2900  
##       14) deaths< 316 21    461526400 42037.9000 *
##       15) deaths>=316 13    572013800 60995.2300 *

I also created a Decision Tree that does not include death numbers. I included that Decision Tree as well. Since the tree shows local effects of breakpoints and interaction events, I could create another regression that shows how these breakpoints can be incorporated into a regression in order to reflect that information. This second tree illustrates how all of the different demographics and locations can change the case numbers in each county.
train_sample2 <- sample(5393, 4313)
covid_train2 <-  no_death_covid[train_sample2, ]
covid_test2 <- no_death_covid[-train_sample2,]




covid_tree2 <- rpart(cases ~ ., data = covid_train2)
covid_tree2
## n= 4313 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 4313 177612900000  2045.5610  
##   2) county=Box Elder,Cache,Carbon,Davis,Duchesne,Emery,Grand,Iron,Juab,Kane,Millard,San Juan,Sanpete,Sevier,Summit,Tooele,Uintah,Wasatch,Washington,Weber 3887   9889306000   718.8899  
##     4) Total.Population< 83673.5 3061    412374800   256.7994 *
##     5) Total.Population>=83673.5 826   6401176000  2431.3100 *
##   3) county=Salt Lake,Utah 426  98459070000 14150.6500  
##     6) School.Enrollment>=76.05 205  26721000000 10472.7900 *
##     7) School.Enrollment< 76.05 221  66392880000 17562.2500 *
options(scipen=999)
rpart.plot(covid_tree2, type = 4, fallen.leaves = TRUE, extra = 101)

Apply Analysis

According to the linear regression model, the most important variables were death, Education: Less than High school, Education: At least Bachelor’s, School Enrollment, and Total Population. The most important of the single variables was Education according to this particular model.
I also built a regression tree so that I could see the interaction points within the models. Through this tree model, I could see interactions that were not seen in the regression model.
In combining what these models have shown, I can see the important data points in seeing the influence of different demographics and mask use on the number of cases in different counties in Utah. Not only can I see the importance of looking into different demographics and seeing how COVID-19 is effecting those populations and use this information to better serve those populations, I can see how reporting of mask use also impacts the number of cases in each county.

Deploy Model

I can use this information from the cluster groups, the significant variables from the regression, and the significant breakpoints from the Decision Tree to create a plan of how to create an action plan for outreach within the community.

Assess Results

These models definitely need to be re-evaluated to focus on the most important variables. Even though the regression model was highly accurate, I believe that creating an alternative model that removes the ‘deaths’ variables from the models might create a different set of variables to focus on when it comes to analyzing cases of COVID-19 in Utah.

Strengths of Kmeans Clustering, Decision Tree, and Regression Analysis

Using a combination of each of these models allows the model builder to check and change how the model is built. By looking at the different clusters that the data can be put into can illustrate how the data can be fit into separate categories. From there, I can build the regression model and see the significance of each separate variable and attempt to create a better model from those significant variables.
Finally, by using that model to build a Decision Tree, I can see how the variables interact and their breakpoints. Given this information that was not specified in the regression model, I can go back and alter the regression model to reflect those newly discovered interactions.
Using a combination of all of these model types allows the model builder to plan, build, discover, and then plan, build, discover again until a more effective model is ready to be operationalized.